random measure
A Bayesian Nonparametric View on Count-Min Sketch
The count-min sketch is a time-and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that it is possible to straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the count-min sketch estimator, inheriting the associated probabilistic guarantees. Using simulated data with known ground truth, we investigate the properties of these estimators. Lastly, we also study a modified problem in which the observation stream consists of collections of tokens (i.e., documents) arising from a random measure drawn from a stable beta process, which allows for power law scaling behavior in the number of unique tokens.
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Denmark > Capital Region > Kongens Lyngby (0.04)
- Telecommunications > Networks (0.34)
- Information Technology > Networks (0.34)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
A Bayesian Nonparametric View on Count-Min Sketch
The count-min sketch is a time-and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that it is possible to straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the count-min sketch estimator, inheriting the associated probabilistic guarantees. Using simulated data with known ground truth, we investigate the properties of these estimators. Lastly, we also study a modified problem in which the observation stream consists of collections of tokens (i.e., documents) arising from a random measure drawn from a stable beta process, which allows for power law scaling behavior in the number of unique tokens.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > California > Orange County > Irvine (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Data Science (0.93)
- Information Technology > Communications > Networks (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)